Chapter 6 Chinese Text Processing

In this chapter, we will discuss one of the most important issues in Chinese language/text processing, i.e., word segmentation. When we discuss tokenization in @ref{#tokenization}, it is easy to do the word tokenization in English as the word boundaries in English are more clearly delimited by whitespaces. Chinese, however, does not have whitespaces between characters, which leads to a serious problem for word tokenization.

In this chapter, we will talk about the most-often used library, jiebaR, for Chinese word segmentation.

6.1 Chinese Word Segmenter jiebaR

First, you haven’t installed the library jiebaR, please install:

Now let us take a look at a quick example.

##  [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指民眾"  
##  [7] "黨"       "不"       "分區"     "被"       "提名"     "人"      
## [13] "蔡壁如"   "黃"       "瀞"       "瑩"       "在昨"     "6"       
## [19] "日"       "才"       "請辭"     "是"       "為領"     "年終獎金"
## [25] "台灣民眾" "黨"       "主席"     "台北"     "市長"     "柯文"    
## [31] "哲"       "7"        "日"       "受訪"     "時則"     "說"      
## [37] "都"       "是"       "按"       "流程"     "走"       "不要"    
## [43] "把"       "人家"     "想得"     "這麼"     "壞"

To segment a text, you first initialize a segmenter seg1 using worker() and use this segmenter to segment() texts.

There are many different parameters you can specify when you initialize the segmenter worker(). You may get more detail via the documentation ?worker. Some of the important arguments include:

  • user = ...: This argument is to specify the path to a user-defined dictionary
  • stop_word = ...: This argument is to specify the path to a stopword list
  • symbol = FALSE: Whether to return symbols (default is FALSE)
  • bylines = FALSE: Whether to return each word one line at a time

From the above examples, it is clear to see that some of the words are not correctly identified by the current segmenter: 民眾黨, 不分區, 黃瀞瑩, 柯文哲. It is always recommended to include a user-defined dictionary when doing the word segmentation because different corpora may have their own unique vocabulary.

##  [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指"      
##  [7] "民眾黨"   "不分區"   "被"       "提名"     "人"       "蔡壁如"  
## [13] "黃瀞瑩"   "在昨"     "6"        "日"       "才"       "請辭"    
## [19] "是"       "為領"     "年終獎金" "台灣"     "民眾黨"   "主席"    
## [25] "台北"     "市長"     "柯文哲"   "7"        "日"       "受訪"    
## [31] "時則"     "說"       "都"       "是"       "按"       "流程"    
## [37] "走"       "不要"     "把"       "人家"     "想得"     "這麼"    
## [43] "壞"

The format of the user-defined dictionary is one word per line. Also, the default encoding of the dictionary is UTF-8. Please note that in Windows, the default encoding of a txt file created by Notepad may not be UTF-8.

Creating a user-defined dictionary may take a lot of time. You may consult 搜狗詞庫, which includes many domain-specific dictionaries created by others. However, it should be noted that the format of the dictionaries is .scel. You may need to convert the .scel to .txt before you use it in jiebaR. To do the coversion automatically, please consult the library cidian.

When you initialize the segmenter, you can also specify a stopword list, i.e., words you do not need to include in the later analyses.

##  [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指民眾"  
##  [7] "黨"       "不"       "分區"     "被"       "提名"     "人"      
## [13] "蔡壁如"   "黃"       "瀞"       "瑩"       "在昨"     "6"       
## [19] "才"       "請辭"     "為領"     "年終獎金" "台灣民眾" "黨"      
## [25] "主席"     "台北"     "市長"     "柯文"     "哲"       "7"       
## [31] "受訪"     "時則"     "說"       "按"       "流程"     "走"      
## [37] "不要"     "把"       "人家"     "想得"     "這麼"     "壞"

So far we did not see the parts-of-speech tag provided by the word segmenter. If you need the tags of the words, you need to specify this need when you initialize the worker().

##          n         ns          n          x          n          n          x 
##     "綠黨"   "桃園市"     "議員"   "王浩宇"     "爆料"       "指"   "民眾黨" 
##          x          p          v          n          x          x          x 
##   "不分區"       "被"     "提名"       "人"   "蔡壁如"   "黃瀞瑩"     "在昨" 
##          x          d          v          x          n          x          x 
##        "6"       "才"     "請辭"     "為領" "年終獎金"     "台灣"   "民眾黨" 
##          n         ns          n          x          x          v          x 
##     "主席"     "台北"     "市長"   "柯文哲"        "7"     "受訪"     "時則" 
##         zg          p          n          v         df          p          n 
##       "說"       "按"     "流程"       "走"     "不要"       "把"     "人家" 
##          x          r          a 
##     "想得"     "這麼"       "壞"

The following table lists the annotations of the POS tagsets used in jiebaR:

You can check the dictionaries being used in your current enviroment:

## [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/jiebaRD/dict"
##  [1] "backup.rda"      "hmm_model.utf8"  "hmm_model.zip"   "idf.utf8"       
##  [5] "idf.zip"         "jieba.dict.utf8" "jieba.dict.zip"  "model.rda"      
##  [9] "README.md"       "stop_words.utf8" "user.dict.utf8"
##  [1] "\""  "."   "。"  ","   "、"  "!"  "?"  ":"  ";"  "`"   "﹑"  "•"  
## [13] """  "^"   "…"   "‘"   "’"   "“"   "”"   "〝"  "〞"  "~"   "\\"  "∕"  
## [25] "|"   "¦"   "‖"   "— " "("   ")"   "〈"  "〉"  "﹞"  "﹝"  "「"  "」" 
## [37] "‹"   "›"   "〖"  "〗"  "】"  "【"  "»"   "«"   "』"  "『"  "〕"  "〔" 
## [49] "》"  "《"

6.2 Case Study 1: Word Frequency and Wordcloud

6.3 Case Study 2: Patterns

## [1] "綠黨_n 桃園市_ns 議員_n 王浩宇_x 爆料_n ,_x 指民眾_x 黨_n 不_d 分區_n 被_p 提名_v 人_n 蔡壁如_x 、_x 黃_zg 瀞_x 瑩_zg ,_x 在昨_x (_x 6_x )_x 日_m 才_d 請辭_v 是_v 為領_x 年終獎金_n 。_x 台灣民眾_x 黨_n 主席_n 、_x 台北_ns 市長_n 柯文哲_n\r 7_x 日_m 受訪_v 時則_x 說_zg ,_x 都_d 是_v 按_p 流程_n 走_v ,_x 不要_df 把_p 人家_n 想得_x 這麼_r 壞_a 。_x"
## [1] "我_r 是_v 在_p 測試_vn 一個_m 句子_n"
## $doc_id
## [1] "character"
## 
## $IPU
## [1] "character"
## 
## $IPU_tag
## [1] "character"

For more information related to the unicode ranage for the punctuations in CJK languages, please see this SO discussion thread.